Update 3 UDFs: Percentile, Quantile and Cluster#17375
Update 3 UDFs: Percentile, Quantile and Cluster#17375suyx1999 wants to merge 10 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the library UDF suite by fixing correctness issues in Percentile/Quantile implementations and introducing a new cluster UDTF for subsequence (window-based) clustering in the dlearn module.
Changes:
- Fixes Percentile-related edge cases (e.g., out-of-bounds handling in GK sketch compression; discrete nearest-rank percentile indexing).
- Adjusts Quantile UDF value encoding/decoding logic for KLL-based quantile computation.
- Adds
UDTFClusterplus clustering utilities (k-means, k-shape, medoid-shape) and integration tests + registration scripts.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| library-udf/src/main/java/org/apache/iotdb/library/dprofile/util/GKArray.java | Prevents OOB during merge/compress when incoming entries are exhausted. |
| library-udf/src/main/java/org/apache/iotdb/library/dprofile/util/ExactOrderStatistics.java | Fixes percentile indexing via discrete nearest-rank; updates class documentation. |
| library-udf/src/main/java/org/apache/iotdb/library/dprofile/UDAFQuantile.java | Updates numeric-to-long encoding logic used by the quantile sketch and output casting. |
| library-udf/src/main/java/org/apache/iotdb/library/dlearn/util/cluster/MedoidShape.java | Adds medoid-shape clustering implementation (coarse k-means + greedy representative selection). |
| library-udf/src/main/java/org/apache/iotdb/library/dlearn/util/cluster/KShape.java | Adds k-Shape clustering implementation (SBD assignment + SVD centroid update). |
| library-udf/src/main/java/org/apache/iotdb/library/dlearn/util/cluster/KMeans.java | Adds univariate-window k-means implementation for subsequences. |
| library-udf/src/main/java/org/apache/iotdb/library/dlearn/util/cluster/ClusterUtils.java | Adds shared utilities (z-normalization, Euclidean distance, FFT-based NCC/SBD). |
| library-udf/src/main/java/org/apache/iotdb/library/dlearn/UDTFCluster.java | Introduces cluster UDTF for windowing a single series and clustering windows; supports label/centroid output. |
| library-udf/src/assembly/tools/register-UDF.sh | Registers the new cluster UDF in the Unix registration script. |
| library-udf/src/assembly/tools/register-UDF.bat | Registers the new cluster UDF in the Windows registration script. |
| integration-test/src/test/java/org/apache/iotdb/libudf/it/dlearn/DLearnIT.java | Adds cluster UDF integration tests and a toy series dataset. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| private long dataToLong(double res) { | ||
| switch (dataType) { | ||
| case INT32: | ||
| return (int) data; | ||
| return (int) res; | ||
| case FLOAT: | ||
| result = Float.floatToIntBits((float) data); | ||
| return (float) data >= 0f ? result : result ^ Long.MAX_VALUE; | ||
| float f = (float) res; | ||
| long flBits = Float.floatToIntBits(f); | ||
| return f >= 0f ? flBits : flBits ^ Long.MAX_VALUE; | ||
| case INT64: | ||
| return (long) data; | ||
| return (long) res; | ||
| case DOUBLE: | ||
| result = Double.doubleToLongBits((double) data); | ||
| return (double) data >= 0d ? result : result ^ Long.MAX_VALUE; | ||
| case BLOB: | ||
| case BOOLEAN: | ||
| case STRING: | ||
| case TEXT: | ||
| case DATE: | ||
| case TIMESTAMP: | ||
| long d = Double.doubleToLongBits(res); | ||
| return res >= 0d ? d : d ^ Long.MAX_VALUE; |
There was a problem hiding this comment.
dataToLong(double res) still routes INT64 values through a double and then casts back to long. For INT64 values beyond 2^53, this loses integer precision and can change the quantile result ordering. Consider ingesting INT32/INT64 using row.getInt(0) / row.getLong(0) (or otherwise passing the original integer type into dataToLong) so the sketch sees exact values.
library-udf/src/main/java/org/apache/iotdb/library/dlearn/UDTFCluster.java
Show resolved
Hide resolved
library-udf/src/main/java/org/apache/iotdb/library/dlearn/util/cluster/MedoidShape.java
Show resolved
Hide resolved
| /** Discrete nearest-rank index into sorted data of length {@code n}; see class Javadoc. */ | ||
| private static int discreteNearestRankIndex(int n, double phi) { | ||
| int idx = (int) Math.ceil(n * phi) - 1; | ||
| return Math.max(0, Math.min(n - 1, idx)); | ||
| } | ||
|
|
||
| public static float getPercentile(FloatArrayList nums, double phi) { | ||
| if (nums.isEmpty()) { | ||
| throw new NoSuchElementException(); | ||
| } else { | ||
| nums.sortThis(); | ||
| return nums.get((int) Math.ceil(nums.size() * phi)); | ||
| return nums.get(discreteNearestRankIndex(nums.size(), phi)); | ||
| } |
There was a problem hiding this comment.
PR description says unit tests were added for the Percentile and Quantile UDF fixes, but this change set only adds cluster-related IT coverage (and no new tests exercising the updated Percentile/Quantile edge cases). If tests were intended, they appear to be missing from the PR.
There was a problem hiding this comment.
Remove tests from PR description. There are existing IT tests for Percentile and Quantile.
| case DOUBLE: | ||
| collector.putDouble(0, res); | ||
| break; | ||
| case TIMESTAMP: | ||
| case DATE: | ||
| case TEXT: | ||
| case STRING: | ||
| case BLOB: | ||
| case BOOLEAN: | ||
| default: | ||
| break; |
There was a problem hiding this comment.
In terminate(), the rank-to-index conversion appears off by one: findMinValueWithRank((long) (rank * sketch.getN())) will pass k == n when rank == 1, but getApproxRank(v) counts values strictly < v, so the valid target ranks are typically in [0, n-1] (nearest-rank would use ceil(n*rank)-1). As-is, rank=1 can return a value larger than the max input (and other ranks are shifted).
| int sampleCount = Math.max(1, (int) (r * n)); | ||
| sampleCount = Math.min(sampleCount, pool.size()); | ||
| Collections.shuffle(pool, rnd); | ||
| List<Integer> sampleIdx = pool.subList(0, sampleCount); | ||
|
|
There was a problem hiding this comment.
MedoidShape.fastKShape() uses Collections.shuffle(pool, rnd) with a default new Random() seed, so results can vary between runs (especially when multiple candidates tie on delta, where the shuffled iteration order decides the winner). For a database UDF this can make repeated queries non-reproducible; consider making the default selection deterministic (e.g., fixed seed, no shuffle when sampling all candidates, or deterministic tie-breaking).
There was a problem hiding this comment.
Clustering techniques inherently involve randomness, which is unavoidable.
…/cluster/MedoidShape.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…Cluster.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Update 3 UDFs: Percentile, Quantile and Cluster
Fix issues in Percentile and Quantile UDFs
Percentile
Quantile
Update Cluster UDF
l, and clusters those subsequences to discover local patterns or segment structure.Input series
INT32/INT64/FLOAT/DOUBLE.⌊n/l⌋windows are used, wherenis the number of valid points).Parameters
llconsecutive samples.kmethodkmeanskmeans,kshape,medoidshape(case-insensitive). Defaults to k-means if omitted.normtruetrue, each subsequence is standardized before clustering.maxiter200outputlabellabel: one cluster id per window;centroid: concatenate thekcentroid vectors in cluster order.sample_rate0.3method=medoidshape; must be in(0, 1].methoddetailsmin(2k, number of windows)clusters, then greedy selection ofkrepresentative subsequences;sample_ratecontrols how many candidates are sampled each round.Output series
Controlled by
output:output=label(default)INT32.⌊n/l⌋.output=centroidDOUBLE.k × l: for clusters 0 → k−1, emit thelcomponents of each centroid in order (concatenated).0, 1, 2, …(placeholders only, no physical time meaning).